Beyond TFIDF Weighting for Text Categorization in the Vector Space Model

نویسندگان

  • Pascal Soucy
  • Guy W. Mineau
چکیده

KNN and SVM are two machine learning approaches to Text Categorization (TC) based on the Vector Space Model. In this model, borrowed from Information Retrieval, documents are represented as a vector where each component is associated with a particular word from the vocabulary. Traditionally, each component value is assigned using the information retrieval TFIDF measure. While this weighting method seems very appropriate for IR, it is not clear that it is the best choice for TC problems. Actually, this weighting method does not leverage the information implicitly contained in the categorization task to represent documents. In this paper, we introduce a new weighting method based on statistical estimation of the importance of a word for a specific categorization problem. This method also has the benefit to make feature selection implicit, since useless features for the categorization problem considered get a very small weight. Extensive experiments reported in the paper shows that this new weighting method improves significantly the classification accuracy as measured on many categorization tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Empirical Evaluation of Centroid-based Models for Single-label Text Categorization

Centroid-based models have been used in Text Categorization because, despite their computational simplicity, they show a robust behavior and good performance. In this paper we experimentally evaluate several centroidbased models on single-label text categorization tasks. We also analyze document length normalization and two different term weighting schemes. We show that: (1) Document length nor...

متن کامل

An Integrated and Improved Approach to Terms Weighting in Text Classification

Traditional text classification methods utilize term frequency (tf) and inverse document frequency (idf) as the main method for information retrieval. Term weighting has been applied to achieve high performance in text classification. Although TFIDF is a popular method, it is not using class information. This paper provides an improved approach for supervised weighting in the TFIDF model. The t...

متن کامل

Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

PU learning occurs frequently in Web pages classification and text retrieval applications because users may be interested in information on the same topic. Collecting reliable negative examples is a key step in PU (Positive and Unlabeled) text classification, which solves a key problem in machine learning when no labeled negative examples are available in the training set or negative examples a...

متن کامل

Using Class Frequency for Improving Centroid-based Text Classification

Most previous works on text classification, represented importance of terms by term occurrence frequency (tf) and inverse document frequency (idf). This paper presents the ways to apply class frequency in centroid-based text categorization. Three approaches are taken into account. The first one is to explore the effectiveness of inverse class frequency on the popular term weighting, i.e., TFIDF...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005